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Abstract 

We  present  an  approach  for  vehicle  classification  in  IR 
video  sequences  by  integrating  detection,  tracking  and 
recognition.  The  method  has  two  steps.  First,  the  moving 
target  is  automatically  detected  using  a  detection  algorithm. 
Next,  we  perform  simultaneous  tracking  and  recognition  us¬ 
ing  an  appearance-model  based  particle  filter.  The  tracking 
result  is  evaluated  at  each  frame.  Low  confidence  in  track¬ 
ing  performance  initiates  a  new  cycle  of  detection,  tracking 
and  classification.  We  demonstrate  the  robustness  of  the 
proposed  method  using  outdoor  IR  video  sequences. 

1.  Introduction 

Recently,  video-based  vehicle  classification  has  gained 
much  attention,  especially  in  automatic  traffic  management, 
surveillance  and  battlefield  awareness.  Typically,  detection 
and  tracking  are  often  solved  before  classification.  In  Lip- 
ton  et  al.  (1998),  a  tracking  and  classification  system  is  de¬ 
scribed  that  can  categorize  moving  objects  as  vehicles  or 
humans.  However,  it  does  not  further  classify  the  vehi¬ 
cle  into  various  classes.  Wu  et  al.  (2001)  uses  parameter¬ 
ized  model  and  neural  networks  for  vehicle  classification. 
In  Gupte  et  al.  (2002),  vehicles  are  modeled  as  rectangular 
patches  with  certain  dynamic  behavior  and  Kalman  filter¬ 
ing  is  used  to  estimate  the  vehicle  parameters.  In  Roller 
(1993),  an  object  classification  approach  that  uses  parame¬ 
terized  3D-models  is  described.  The  system  uses  a  3D  poly¬ 
hedral  model  to  classify  vehicles  in  a  traffic  sequence.  In 
Kagesawa  et  al.  (2001),  a  method  for  recognizing  a  vehi¬ 
cle’s  maker  and  model  is  proposed.  It  first  creates  a  com¬ 
pressed  database  of  local  features  of  target  vehicles  from 
training  images  and  then  matches  them  with  the  local  fea¬ 
tures  of  the  probe  image  for  recognition. 

In  this  paper,  we  tackle  the  problem  of  vehicle  classifi¬ 
cation  by  integrating  detection,  tracking  and  recognition.  In 
our  system,  the  moving  vehicle  is  automatically  detected, 
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Figure  1 :  A  flow  chart  of  our  system. 


tracked  and  recognized  without  any  interruptions.  The  flow 
chart  of  our  system  is  shown  in  Fig.l.  The  video  sequences 
are  input  to  our  system.  The  moving  target  is  detected  using 
temporal  variance  analysis.  The  target  is  tracked  and  clas¬ 
sified  simultaneously  using  an  appearance  model  and  mix¬ 
tures  of  probabilistic  principal  component  analysis  Tipping 
and  Bishop  (1999)(PPCA).  Evaluation  of  the  tracking  per¬ 
formance  is  performed  at  each  frame.  If  the  performance 
falls  below  some  threshold,  the  cycle  of  detection,  tracking 
and  classification  is  re-initiated,  otherwise  the  tracking  and 
classification  propagates  to  the  next  frame. 

There  are  four  types  of  vehicles  used  in  the  experiment. 
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They  are  ‘m60’,  ‘brdm’,  ‘wetting’  and  ‘bmp’.  Four  probe 
video  sequences  each  of  which  contains  different  vehicles 
are  used  for  classification.  Fig. 2  shows  two  image  samples 
from  the  probe  video  sequence  ‘bmpl’.  Fig. 2(a)  shows  the 
side  view  of  ‘bmp’  and  (b)  the  frontal  view.  The  target- to- 
background  contrast  is  very  low  for  the  IR  images.  This 
adds  much  difficulty  for  the  detection  and  tracking  of  mov¬ 
ing  target. 

Unlike  Zhou  et  al.  (2003)’s  method  which  manually  se¬ 
lects  the  moving  target  in  the  first  frame,  we  automatically 
select  it  using  a  detection  algorithm.  Because  of  the  pres¬ 
ence  of  smoke  and  dust  in  IR  videos  as  showed  in  Fig. 2,  it 
is  hard  to  position  a  tight  rectangular  bounding  box  from 
the  detection  algorithm.  Consequently,  the  tracker  drifts 
quickly.  This  brings  a  need  for  the  evaluation  of  the  tracking 
performance.  The  evaluation  generates  a  confidence  mea¬ 
sure  to  indicate  whether  we  should  restart  the  detection  once 
the  tracking  confidence  falls  below  a  threshold. 

We  use  mixtures  PPCA  Tipping  and  Bishop  (1999)  for 
appearance  modeling.  We  then  compute  the  posteriori  prob¬ 
ability  of  finding  the  appearance  of  each  object  in  the  given 
video  and  assign  the  label  corresponding  to  the  maximum. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2 
describes  the  detection  algorithm.  Section  3  describes  the 
tracking  and  classification  algorithm.  Section  4  details  the 
simultaneous  evaluation  for  the  tracking  and  section  5  de¬ 
scribes  experiments.  Finally,  conclusion  and  future  work 
are  discussed  in  section  6. 

2.  Target  Detection 

Detection  plays  an  important  role  in  our  system.  It  is  a  pre¬ 
requisite  for  the  tracking.  It  gives  an  initial  bounding  box 
surrounding  the  target  and  re-initialize  the  target  if  tracking 
confidence  measure  is  low. 

Given  a  video  sequences  {Ii},  we  set  mi  =  Ii  and 
mv i  =  I\  x  ii.  The  operator  x  is  the  element-by-element 
produce  of  two  matrices.  The  following  m^,  mvi  and 
imvari  are  defined  as 

m  =  {( N  -  1)  *  ra*_i  +  h}/N,  (1) 

mvi  =  {(TV  —  1)  *  mvi-i  +/jX  h}/N,  (2) 

imvari  —  Vmvi  ~  mi  x  mi,  (3) 

where  N  is  the  window  size  for  detection  which  is  150 
in  our  experiment. 

For  the  element  p(i,  j )  in  imvari >  we  will  set  p(i,j)  =  1 
if  p(i,  j )  >  T,  otherwise  p(i,  j )  =  0,  where  T  is  the  thresh¬ 
old.  Now  imvari  is  converted  to  a  binary  image  which 
we  call  the  variance  image.  We  then  select  the  rectangular 


bounding  box  for  the  moving  target  by  checking  p(i,j)  =  1 
in  the  image. 

Figures  3  and  4  show  the  detection  results  for  ‘brdm’  and 
‘m60’  respectively.  From  Fig. 3,  we  can  see  the  bounding 
box  is  very  big  due  to  the  smoke  emitted  by  the  vehicle. 
In  Fig. 4,  the  similarity  between  the  environment  and  target 
affect  the  bounding  box  localization. 

3.  Target  Tracking  and  Classification 

This  section  describes  the  vehicle  tracking  and  classifica¬ 
tion  algorithm.  In  section  2.1,  the  state  space  model  used 
for  tracking  and  classification  is  described.  Tracking  and 
classification  are  implemented  simultaneously  by  estimat¬ 
ing  the  posterior  distribution  .  In  section  2.2,  the  mixtures 
of  PPCA  algorithm  for  estimating  the  distribution  of  iden¬ 
tity  variable  for  the  classification  is  detailed. 

3.1.  State  Space  Model 

A  time  series  state  space  model  uses  the  state  variable  xt  = 
{nt,6t},  which  includes  identity  variable  nt  and  2D  affine 
transformation  motion  parameters  6t.  The  system  equation 
is  written  as 

nt  =  nt- 1  Qt  =  Qt- 1  +  Ut,  t  >  1  (4) 

where  we  assume  that  the  motion  variable  follows  a 
Markov  process  with  ut  as  a  white  Gaussian  noise  pro¬ 
cess.  nt  £  N  =  {1,2,  •••,7V}  indexes  the  gallery  set 
{luh,  •  •  *  ,In}. 

A  simple  formulation  of  the  observation  equation  can  be 
characterized  as 

Zt  =  T{Yf,0t}  =  Int+Vt  (5) 

Where  Zt  is  the  image  patch  of  interest  in  the  video  frame, 
T  is  an  affine  transformation  to  normalize  the  image  to  the 
same  size  of  the  gallery  images,  and  Vt  is  the  noise.  The  ob¬ 
servation  equation  is  equivalently  characterized  by  the  like¬ 
lihood  p(Yt\nt,  0t)  =  p(Zt\nt).  In  the  next  section,  we 
define  p(Zt  \ nt)  as  mixtures  of  PPCA. 

The  essence  of  the  approach  is  posterior  probability 
computation,  i.e.  computing  p(nt ,  0t\Yi:t),  whose  marginal 
posterior  probability  p(nt\Y1:t)  solves  the  classification  task 
and  marginal  posterior  probability  p(6t\Yi:t)  solves  the 
tracking  task. 

Classification  is  based  on  a  Maximum  A  Posteriori 
(MAP)  decision  rule,  namely  finding  nt  that  maximizes 
p(nt\Yi:t).  The  Sequential  Importance  Sampling(SIS)  Liu 
and  Chen  (1998)  method  is  used  to  approximate  and  prop¬ 
agate  the  posterior  probability  p(nu  0t\Yi:t),  and  marginal¬ 
ization  over  variable  0t  is  carried  out  before  applying  the 
classification  rule.  Detailed  descriptions  can  be  found  in 
Zhou  et  al.  (2004). 
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Figure  2:  Image  frames  from  video  sequences  ‘bmpl’.  It  shows  the  side  view(a)  and  the  frontal  view(b)  of  the  vehicle. 


Figure  3:  Detection  result  for  ‘brdm’.  (a)  is  the  original  image  and  (b)  is  the  detected  target  chip.  It  shows  how  the  smoke 
emitted  by  the  vehicle  affects  the  detection  result. 


3.2  Mixtures  of  Probabilistic  PCA 

Subspace  analysis  techniques  have  attracted  growing  inter¬ 
est  in  computer  vision  research.  In  particular,  eigenvector 
decomposition  has  been  shown  to  be  an  effective  tool  for 
solving  problems  by  using  a  low-dimensional  vector  to  rep¬ 
resent  high-dimensional  vector.  Here  we  will  follow  Tip¬ 
ping  and  Bishop  (1999)  for  the  mixtures  of  PPCA. 

Given  a  set  of  m  by  n  images  {2^},  we  form  a  set  of 
vectors  {ti},  where  U  E  Rd=mn ,  lexicographic  ordering 
of  the  pixel  elements  of  each  image  Z{.  For  any  t  in  {£*},  we 
relate  it  to  a  corresponding  7-dimensional  vector  variable  x 
as: 

t  =  Wx  -\-  p  -\-  s  (6) 

where  d^>  7  and  p  is  the  mean  of  the  x. 

For  the  case  of  isotropic  noise  6  ^  7V(0,  a2 1)  ,  the  dis¬ 
tribution  over  t- space  for  a  given  x  of  the  form 

p(t\x)  =  (2na2)~d/2exp{-^\\t  -  Wx  -  p\\2}  (7) 


With  a  Gaussian  prior  for  the  x ,  we  obtain  the  marginal 
distribution  of  t 

p(t)  =  {2Tr)~d/2\C\~1/2exp{-^(t  -  pfC^it  -  p)}, 

(8) 

where  the  covariance  is 


C  =  a2I  +  WWT. 


(9) 


The  mixtures  of  PPCA  can  model  more  complex  data 
structures.  The  model  parameters  are  determined  using 
maximum  likelihood  estimation.  The  mixture  model  is  de¬ 
fined  as: 

M 

p(t)  =  ^2^iP(t\i)  (10) 

i=  1 

where  p(t\i)  is  a  single  PPCA  model  and  77  is  the  corre¬ 
sponding  mixing  proportion,  with  71^  >  0  and  71  i  =  1- 
Now  the  three  parameters  /i,  W  and  a2  are  associated  with 
each  of  the  M  mixture  components.  We  use  an  iterative  EM 
algorithm  for  estimating  the  model  parameters. 


4  Tracking  Evaluation 

Most  practical  tracking  systems  often  fail  under  some  situa¬ 
tions.  This  could  be  either  because  of  illumination  changes, 
pose  variation  or  occlusion.  Therefore,  the  need  for  au¬ 
tomatic  performance  evaluation  emerges  in  these  applica¬ 
tions.  Fig. 5  shows  the  tracking  result  after  running  the 
tracker  for  some  time.  The  bounding  box  is  so  large  that  one 
concludes  that  the  tracker  is  already  failing.  Hence,  evalu¬ 
ation  is  necessary  to  help  us  terminate  tracking  and  restart 
the  detection-tracking-classification  cycle. 

Our  evaluation  algorithm  is  based  on  measuring  the  ap¬ 
pearance  similarity  and  tracking  uncertainty.  The  following 
features  are  examined  in  our  evaluation: 

1.  Trace  complexity  qtc\  We  define  the  trace  complexity 
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Figure  4:  Detection  result  for  ‘m60’.  (a)  is  the  original  image  and  (b)  is  the  detected  target  chip.  It  shows  how  contrast  and 
SNR  affect  the  detection  result. 


Figure  5:  An  example  of  poor  tracking. 


as  the  ratio  of  the  curve  length  and  straight  length  be¬ 
tween  the  target  centroids  in  different  frames. 

2.  Motion  step  qms :  It  is  defined  as  the  distance  between 
the  box  centers  in  two  consecutive  frames. 

3.  Scale  change  qsc :  To  examine  changes  in  object  scale, 
we  use  two  clues.  One  is  the  ratio  of  the  current  area  to 
the  initial  area,  the  other  is  the  scale  change  velocity. 

4.  Shape  similarity  qss :  The  change  in  the  aspect  ratio 
of  the  bounding  box  is  also  useful  in  providing  some 
information  about  the  object  shape.  It  is  defined  as  the 
ratio  of  the  current  aspect  ratio  over  the  initial  ratio. 

5.  Appearance  change  qac :  Three  measures  are  used  in 
our  algorithm,  the  first  one  is  the  absolute  pixel  by 
pixel  change  between  the  current  frame  and  the  ini¬ 
tial  frame,  the  second  one  is  the  histogram  difference 
between  the  current  frame  and  the  initial  frame  and  the 
last  one  is  related  to  the  tracking  algorithm  over  which 
the  proposed  algorithm  was  tested. 

To  obtain  a  comprehensive  measure  of  the  tracking  per¬ 
formance,  we  combine  the  above  five  indicators.  We  first 
use  empirical  thresholds  to  find  whether  the  tracker  is  un¬ 
certain  according  to  the  above  five  metrics,  then  we  sum 
the  five  indicators  using  different  weights  to  arrive  at  a  con¬ 
fidence  measure  q.  If  the  sum  drops  below  some  thresh¬ 
old,  we  conclude  that  the  tracking  performance  is  poor  and 


needs  re-initialization. 

q  =  Wjl\qj  <  A j],  J  G  {tc,ms,  sc,  ss,  ac}  (11) 
oeJ 

where  Wj  and  A  j  are  the  corresponding  weights  and  thresh¬ 
olds  for  the  evaluation. 

5  Experiments 

In  this  section,  we  give  details  of  our  implementation. 
Training  and  testing  are  described  in  the  next  two  sec¬ 
tions  respectively.  In  our  experiment,  the  vehicle  mo¬ 
tion  is  characterized  by  6  =  (ai,  c&2,  ^3,  a^,  tx,ty),  where 
{ai,  a2,  as,  (14}  are  the  deformation  parameters  and  (tx,ty) 
are  the  2D  translation  parameters.  By  applying  an  affine 
transformation  using  6  as  parameters,  we  crop  the  region 
of  interest  so  that  it  has  the  same  size  as  the  still  template 
in  the  gallery  and  perform  zero-mean-unit- variance  normal¬ 
ization.  The  region  of  interest  is  24  x  30  in  size. 

5.1  Training 

We  use  one  video  sequence  for  each  vehicle  and  obtain  the 
tracking  result.  Then  we  select  36  images  for  each  vehicle 
in  the  gallery.  The  pertinent  parameters  for  the  experiment 
are  M  =  2  and  7  =  15. 

Fig. 6  is  the  gallery  of  the  vehicle  images.  There  are  a  to¬ 
tal  of  144  images  in  the  gallery.  They  are  ‘m60’,  ‘brdm’, 
‘wetting’  and  ‘bmp’  from  top  to  bottom,  each  has  three 
rows. 

After  we  have  the  gallery  images,  we  use  mixtures  of 
PPCA  to  estimate  the  parameters  77,/^,  Wi  and  of. 

5.2  Testing 

For  each  frame,  we  get  the  motion  parameters  after  track¬ 
ing  and  cropping  out  the  region  of  interest  of  size  24  x  30 
from  the  original  image.  After  performing  zero  mean  and 
unit  variance  operation,  we  substitute  the  vector  as  t  into 
equation  (10)  and  get  the  probabilities  for  each  vehicle.  We 
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Figure  6:  Gallery  of  vehicle  images.  The  image  size  is  24 
by  30. 


pick  the  vehicle  which  has  the  highest  probability  as  our 
classification  result  after  normalization.  The  probabilities 
propagate  to  the  next  frame.  In  each  frame,  if  the  confi¬ 
dence  measure  is  below  some  threshold,  the  detection  will 
restart  20  frames  before  the  drifting  point  and  tracking  and 
classification  will  restart  too. 

Fig. 7  shows  the  tracking  and  recognition  results  for  ‘wet- 
tingl’  and  Fig. 8  is  for  ‘bmpl’.  In  Fig. 7(a),  the  image  on 
the  top  is  the  tracking  result  for  the  current  frame.  We  put 
a  bounding  box  for  the  vehicle  which  we  are  tracking  in 
each  frame  with  a  different  color  for  different  vehicles.  The 
image  on  the  left  of  the  bottom  is  the  classification  score 
which  is  the  probability  of  seeing  each  vehicle  in  the  video. 
It  shows  the  result  from  the  first  frame  to  the  current  frame. 
The  image  to  the  right  is  the  tracking  confidence  measure 
which  represents  the  probability  of  the  correct  tracking  re¬ 
sult.  We  will  restart  detection  and  tracking  if  the  measure 
falls  below  the  threshold  of  0.5.  The  same  description  ap¬ 
plies  to  Fig7(b)  and  Fig.8. 

From  Fig. 7,  we  observe  that  the  recognition  result  for 
the  4  wetting  1’  is  very  good  because  a  high  probability  is  as¬ 
sociated  with  ‘wetting’  (dotted  blue  line)  on  almost  every 
frame.  There  are  several  peaks  and  valleys  for  the  dotted 
blue  line  due  to  the  re-initialization  of  the  tracking  and  the 
evaluation  probability  on  the  right  drops  very  quickly  at  cor¬ 
responding  frames.  In  Fig.8,  for  the  recognition  of  ‘bmpl’, 
it  is  confused  by  ‘brdm’  for  the  first  half  of  the  sequence. 
It  is  very  hard  to  get  an  initial  tight  bounding  box  due  to 
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0 
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Table  1 :  Confusion  matrix  for  vehicle  classification  experi¬ 
ment. 


the  smoke  emitted  by  ‘bmpl’  using  the  detection  algorithm. 
The  tracker  quickly  drifts  away  after  about  40  frames  given 
the  initial  location.  For  frame  99,  the  result  is  incorrect,  as  it 
gives  ‘brdm’  as  the  recognition  result.  The  result  becomes 
stable  and  correct  after  400  frames.  After  running  the  whole 
video  sequence,  the  correct  recognition  result  is  quite  good. 
For  this  situation,  we  will  classify  that  the  vehicle  we  are 
tracking  is  ‘bmp’  which  yields  the  correct  result. 

The  results  of  the  experiment  are  summarized  in  Table  1 . 
Each  number  in  a  row  is  the  recognition  percentage  of  the 
vehicle.  Taking  the  second  row  as  an  example,  93.82%  of 
the  whole  sequence  recognizes  the  vehicle  as  ‘bmp’,  while 
3.17%  as  ‘brdm’  and  3.01%  as  ‘bmp’.  No  frame  recognizes 
it  as  ‘wetting’ .  The  elements  in  the  diagonal  give  the  correct 
recognition  score  for  our  experiment.  The  overall  accuracy 
of  the  recognition  is  89.07%. 


6  Conclusion  and  Future  Work 

In  this  paper,  we  have  proposed  an  approach  for  vehicle 
classification  by  integrating  detection,  tracking  and  recogni¬ 
tion.  The  experiment  results  prove  our  method’s  robustness 
and  effectiveness. 

Our  future  work  will  include  improving  detection,  track¬ 
ing  and  evaluation  algorithms  and  developing  a  more  robust 
and  stable  recognition  algorithm.  Large  data  set  will  also  be 
tested  to  obtain  a  more  general  analysis. 
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Figure  7 :  Tracking  and  recognition  results  for  ‘wettingl  ’ .  The  results  are  from  frame  1  to  99  for  (a)  and  frame  1  to  799  for  (b). 
The  top  panel  shows  the  original  image  and  tracking  result,  the  bottom  left  panel  shows  the  recognition  density  p(nt\Yi:t), 
and  the  bottom  right  panel  shows  the  tracking  confidence  q. 
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Figure  8:  Tracking  and  recognition  results  for  ‘bmpl’.  The  results  are  from  frame  1  to  99  for  (a)  and  frame  1  to  830  for  (b). 
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